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Mining a large cohort of single-cell transcriptomics data, here we employ combinatorial 
optimization techniques to chart the landscape of optimal combination therapies in cancer. 
We assume that each individual therapy can target any one of 1269 genes encoding cell 
surface receptors, which may be targets of CAR-T, conjugated antibodies or coated nano- 
particle therapies. We find that in most cancer types, personalized combinations composed 
of at most four targets are then sufficient for killing at least 80% of tumor cells while sparing 
at least 90% of nontumor cells in the tumor microenvironment. However, as more stringent 
and selective killing is required, the number of targets needed rises rapidly. Emerging indi- 
vidual targets include PTPRZ1 for brain and head and neck cancers and EGFR in multiple tumor 
types. In sum, this study provides a computational estimate of the identity and number of 
targets needed in combination to target cancers selectively and precisely. 
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can be treated based on its genomic characteristics!~. 

Several trials have suggested that it is possible to collect 
genomics data fast enough to inform treatment decisions*->. 
Meta-analysis of Phase I clinical trials completed during 
2011-2013 showed that overall, trials that used molecular bio- 
marker information to influence treatment plans gave better 
results than trials that did not®. However, most precision oncol- 
ogy treatments utilize only one or two medicines, and resistant 
clones frequently emerge, emphasizing the need to deliver per- 
sonalized medicine as multiple agents combined®!!. Important 
opportunities to combine systems biology and design of nano- 
materials have been recognized to deliver medicines in combi- 
nation to overcome drug resistance and combine biological 
effects!?. 

The design of future precision oncology treatments is very 
challenging, given the growing recognition that tumors typically 
have considerable intra-tumor heterogeneity (ITH)!*-!4 and thus 
need to be targeted with a combination of medicines such that as 
many as possible tumor cells are hit by at least one medicine. Our 
analysis is based on two recently emerging technologies: (1) the 
advancement of single-cell transcriptomics and proteomics 
measurements from patients’ tumors, which are anticipated to 
enter gradually into clinical use!°, and (2) the introduction of 
“modular” treatments that target specific overexpressed genes/ 
proteins to recognize cells in a specific manner and then use 
either the T-cell immune response or a lethal toxin to kill the 
tumor cells preferentially. Here we briefly summarize previous 
related work on those treatments, before moving to present our 
approach. 

We focus our analysis on genes encoding receptors on the cell 
surface, as these may be precisely targeted by any of at least six 
technologies, including CAR-T therapy!®, immunotoxins ligated 
to antibodies!7:!8, immunotoxins ligated to mimicking peptides!?, 
and conventional chemotherapy ligated to nanoparticles*”. These 
treatments are all termed “modular”, as they include one part that 
specifically targets the tumor cell via a gene/protein overexpressed 
on its surface and another part, the cytotoxic mechanism that kills 
the cells. 

Two recent genome-wide analyses of modular therapies have 
focused on CAR-T therapy*!?, so we focus first on this tech- 
nology to put our work in context. In the original formulation, 
CAR-T therapy used one cell surface target that marks the cells of 
interest, such as CD19 as a marker for B cells. To date, CAR-T 
therapy has been effective in achieving remissions for some blood 
cancers!®3, but less effective for solid tumors. MacKay et al.? 
focused primarily on single targets and looked at combinations of 
two targets and did all analysis in silico. Dannenfelser et al.?! 
focused on predicting combinations of two and three targets and 
did most of their work in silico, with in vitro validation of two 
high-scoring predicted combinations in renal cancer. Impor- 
tantly, these studies analyzed bulk tumor and normal expression 
data to identify likely targets. Here, we present analysis that aims 
to identify modular targets based on the analysis of tumor single- 
cell transcriptomics instead of bulk transcriptomics. This enables 
one to study the research questions at a higher resolution but 
presents specific analytical challenges that need to be addressed. 

Two related difficulties with CAR-T therapy are (i) toxicity to 
non-cancer cells**> and (ii) difficulty in finding single targets 
that are sufficiently selective?*. To address the toxicity problem, 
MacKay et al.?7 selected 533 targets that had low expression in 
most tissues in the Genotype-Tissue Expression (GTEx) data; 
however, their analysis did not require that the targets are cell 
surface proteins. We proceed in a stepwise manner; we start with 
a formal analysis of a space of 1269 candidate cell surface 
receptors. Then, we add a low-expression requirement like that of 


P ersonalized oncology offers hope that each patient’s cancer 


MacKay et al.??. For completeness, we also tested their set of 
533 genes. 

To address the selectivity problem, various groups have engi- 
neered composite forms of CAR-T treatments that implement 
Boolean AND, OR, and NOT gates that have been tested for 
combinations of up to three target proteins**->°. Both MacKay 
et al. and Dannenfelser et al. presented in silico methods focusing 
on AND gates and pairs or trios of targets; Dannenfelser et al. 
analyzed 2538 likely cell surface proteins that are not necessarily 
receptors. We have chosen to focus on the simpler logical OR 
construction because that can be achieved not only by CAR-T 
technology?”8, but can also be implemented via other modular 
receptor-mediated treatment technologies by combining multiple 
single-target treatments, assuming that the composite treatment 
kills a cell if any one of the single treatments kill the cell. Con- 
ceptually, such a logical OR combination treatment can still 
achieve selectivity by choosing targets, each of which is expressed 
on a much higher proportion of cancer cells than non-cancer 
cells. One of our key contributions is to show that by using 
techniques from combinatorial optimization, one can find such 
effective combinations involving many targets, while previous 
studies were limited to at most three targets. 

Beyond CAR-T, our analysis applies to several additional types 
of modular treatment technologies that rely instead on receptor- 
mediated endocytosis (RME) delivering a toxin via a targeted 
receptor to enter the cell?!32, These RME-based technologies 
include, e.g., conjugated antibodies and toxin delivering nano- 
particles. Like CAR-T, these technologies do not downregulate 
the target receptor. For RME technologies and other technologies 
that work intracellularly, we anticipate combining modular 
treatments from one technology such that all treatments use the 
same toxin or mechanism of cell killing, thereby mitigating the 
need to test for interaction effects between pairs of different 
treatments. 

In this work, we aim to answer systematically and precisely two 
basic translational questions. First, how many targeted treatments 
are needed to selectively kill most tumor cells while sparing most 
of the non-tumor cells in a given patient? And second, given a 
cohort of patients to treat, how many distinct single-target 
treatments need to be prepared beforehand to treat each patient 
effectively with the per-patient minimum number of targeted 
treatments? To address these questions, in this study we design 
and implement a computational approach named MadHitter 
(after the Mad Hatter from Alice in Wonderland) to identify 
optimal precision combination treatments that target membrane 
receptors (Fig. 1A—C). We define three key parameters related to 
the stringency of killing the tumor and protecting the non-tumor 
cells and explore how the optimal treatments vary with those 
parameters (Fig. 1B, C, Supplementary Fig. 1). Solving this pro- 
blem is analogous to solving the classical “hitting set problem” in 
combinatorial algorithms**, which is formally defined in the 
“Methods” section (see also Supplementary Note 1 for an 
extensive introduction to related past work). Unlike the previous 
studies on CAR-T targets, we define the problem in a persona- 
lized manner, intending that each patient will get optimal treat- 
ments for their tumor from among a collection of treatments 
available at the cohort level. 


Results 

The data and the combinatorial optimization framework. We 
focused our analysis searching for optimal treatment combina- 
tions in nine single-cell RNA-seq data sets that include tumor 
cells and non-tumor cells from at least three patients that were 
publicly available at the onset of our investigation (“Methods”; 
Table 1). Those data sets include four brain cancer data sets and 
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Fig. 1 Conceptual example of MadHitter analysis of single-cell data transcriptomics from three cancer patients. A A cohort of patients arrives for a 
study; single-cell tumor microenvironment (TME) transcriptomics data are collected from each patient; the data are analyzed with MadHitter and for each 
patient an optimal personalized combination of targeted therapies is selected from a pre-specified set (pill bottle). MadHitter aims to select targeted 
therapies that are modular, having a recognition unit that is gene/protein-specific, and a common killing subunit (for all gene targets). Icons of four 
modular therapies are shown; we focus on the three for which the target protein must be on the cell surface and the two for which it must be a receptor, 
and we only mention degraders. Algorithm parameters are denoted near the MadHitter icon in panel A and explained in later panels. B The single-cell data 
are in two matrices with genes as rows and cells as columns, partitioned into tumor (T) and non-tumor (N) cells. The expression ratio r determines how 
much a gene must be overexpressed for a cell to be targeted. A gene is considered ‘overexpressed’ if its expression is at least r times the mean, reference 
level among cells with non-zero expression. The matrices on the right side show a Boolean representation of which targets kill which cells, based on the 
expression values in matrix B and taking r = 2. Accordingly, the combination of EGFR and KDR would kill all tumor cells and would spare all non-tumor cells. 
C MadHitter seeks a minimum-size combination of targets that would kill many tumor cells and few non-tumor cells, in a patient-specific manner. The /b 
and ub parameters are the lower bound on the fraction of tumor cells killed and the upper bound on the fraction of non-tumor cells whose killing is 
tolerated, respectively. Baseline settings are r = 2, lb = 0.8, and ub = 0.1, and can be varied. The right side of (C) shows a hypothetical example of the 
tradeoff between killing tumor cells and sparing non-tumor cells. MadHitter would select target set B since only it satisfies both baseline settings. 


one each from head and neck, melanoma, lung, breast and colon 
cancers. Most analyses were done for all data sets, but for clarity 
of exposition, we focused the main text analyses on four data sets 
from four different cancer types (brain, head and neck, mela- 
noma, lung) that are larger than the other five and hence, make 
the optimization problems more challenging. The results for the 
other five data sets are provided in the Supplementary Notes. 


To formalize our questions as combinatorial optimization 
hitting set problems, we define the following parameters and 
baseline values and explore how the optimal answers vary as 
functions of these parameters: We specify a lower bound on the 
fraction of tumor cells that should be killed, /b, which ranges from 
0 to 1. Similarly, we define an upper bound on the fraction of 
non-tumor cells killed, ub, which also ranges from 0 to 1. To be 
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Table 1 Summary descriptions of single-cell data sets from solid tumors used either for analysis (9) or preliminary testing (5 
additional) and one liquid tumor data set used for validation (1 additional). 


Data set code Cancer type(s) 


GSE75688 
GSE118389 
GSE89567 
GSE103224 
GSE70630 
GSE57872 
GSE102130 
GSE84465 
GSE81861 
GSE103322 
GSE115978 
GSE118828 


441 (11) 
804 (6) 
5097 (10) 
23793 (8) 
4044 (6) 
440 (6) 
2858 (9) 
1091 (4) 
272 (10) 
2093 (13) 
2018 (23) 


Breast 

Breast 

Brain (glioma) 

Brain (glioma) 

Brain (glioma) 

Brain (glioma) 

Brain (6 glioma and 3 glioblastoma) 
Brain (glioblastoma) 
Colorectal 

Head and neck 
Melanoma 

Ovarian 


GSE67980 
E-MTAB-6149 
GSE132509 


124 (21) 
7351 (5) 


Prostate 

Lung 

Acute lymphoblastic 
leukemia (ALL) 


Cancer cells (patients) 


1415 (11 primary) 
973 (5 metastasis) 


21329 (6 B-ALL) 


Non-cancer cells 
(patients) 


Clinical follow-up 


Metastasis or not 


314 (6) Metastasis or not 


1146 (9) No 
- No 
303 (6) No 


= No 
94 (5) No 
651 (4) No 
160 (6) No 
3197 (15) No 
4334 (32) Yes, immuno-therapy 
578 (2) No 


- Metastasis or not 
2730 (5) No 
4744 (6 B-ALL) No 


Data sets are ordered so that those from the same or similar tumor types are on consecutive rows. The first 13 and 15th data sets were obtained either from GEO or the Broad Institute Single Cell Portal, 
but the GEO code is shown. The data set on the 14th row was obtained from ArrayExpress. In some data sets that have both cancer and non-cancer cells, there may be patients for whom only one type or 
the other is provided. Hence, the numbers in parentheses in the third and fourth columns may differ. In most cases, the number of patients we analyzed is the minimum of the two numbers in 
parentheses; for example, for GSE103322, we analyzed data on 13 patients, which is the minimum of (13) and (15). Data set GSE115978°! supersedes and partly subsumes GSE72056%. For the blood 
cancer data set, we used only the six samples from B-cell acute lymphoblastic leukemia (ALL) patients and report only those cells as counted in the version at (http://tisch.comp-genomics.org); the data 
set also contains samples from two patients with T-cell ALL and three healthy donors that we did not analyze. 


concrete, our baseline settings are Jb = 0.8 and ub = 0.1 but we 
also explore other values, and the approach (and code) are generic 
and can explore the landscape of solutions at any other settings 
deemed of interest. Therapies such as conjugated antibodies and 
nanoparticles are designed to recognize and kill cells that 
overexpress a target protein on the cell surface. To model 
quantitatively the overexpression associated with recognition of 
cells overexpressing the target, we introduced a parameter, r, that 
designates the ratio of the mean overexpression of a given target 
in cancer vs non-cancer cells within the tumor microenviron- 
ment. The expression ratio r defines which cells are killed, as 
follows (Fig. 1B): Denote the mean expression of a gene g in non- 
cancer cells that have non-zero expression by E(g). A given cell is 
considered killed if gene g is targeted and its expression level in 
that cell is at least rx E(g). Higher values of r thus model more 
selective killing. 

Having r as a modifiable parameter anticipates that in the 
future one could experimentally tune the overexpression level (see 
Supplementary Note 1) at which cell killing occurs”!. In this 
respect, technologies that rely on RME to get a toxin into the cell 
are particularly tunable because there is known to be a non-linear 
relationship, called “superselectivity” between the number of 
protein copies on the cell surface and the probability that RME 
occurs successfully*+. In these technologies, the toxin or other 
therapy delivered by the modular treatment enters cells in a gene- 
specific manner*°. CAR-T therapy activates T-cell killing against 
cells in a gene-specific manner?! 2, 

The question of how sharp a non-linear superselective jump in 
ligand-receptor binding as a function of surface density of 
receptors is achievable in a nanoparticle system has been the 
focus of both experimental and theoretical studies for nanopar- 
ticles and CAR-T, as reviewed in Supplementary Note 1. The 
inverse of the slope of this binding function at the transition is 
one way to estimate a realistic value of r. For nanoparticles, an 
increase of a factor of at most 2 in receptor density/expression 
appears to be sufficient to go from almost no binding to almost 
perfect binding and this justifies our baseline choice of r = 2.0. 
For CAR-T, investigation of the binding curve has only recently 


started, and the first key study achieved experimentally a receptor 
density ratio of 5-10 between almost no binding and almost 
perfect binding. These CAR-T experiments were a proof-of- 
principle study using combinatorial logic encoded genetically and 
it seems likely that more sophisticated CAR-T circuitry can 
achieve steeper binding curves corresponding to lower values of r. 
Therefore, for most of our analyses, the expression ratio r is 
varied from 1.5 to 3.0, with a baseline of 2.0, based on 
experiments in the lab of N.A. and related to combinatorial 
chemistry modeling*4; in one analysis, we varied r up to 5.0 
(Supplementary Notes). For the sake of modularity and 
applicability across technologies, we model a framework in which 
each nanoparticle has multiple copies of one ligand attached and 
the distinct nanoparticles act in a logical OR fashion. Combining 
superselectivity with a statistical mechanics analysis, Tian and 
colleagues validated an alternative logical OR framework in which 
multiple ligands are attached to a single nanoparticle; impor- 
tantly, they validated this heterogeneous logical OR framework 
using ligands for the proteins encoded by the human genes LRP1 
and SCARB1°°. 

Given these definitions, we solve the following combinatorial 
optimization hitting set problem (“Methods”): Given an input of 
a single-cell transcriptomics sample of non-tumor and tumor 
cells for each patient in a cohort of multiple patients, bounds ub 
and Ib, ratio r, and a set of target genes, we seek a solution that 
includes a minimum-size combination of targets in each 
individual patient, while additionally minimizing the size of all 
targets given to the patient cohort. The latter is termed the global 
minimum-size hitting set (GHS) in computer science terminology 
or the cohort target set (CTS) in terminology specific to our 
problem, while the optimal hitting set of genes targeting one 
patient is termed the individual target set (ITS). This optimum 
hitting set problem with constraints can be solved to optimality 
using integer linear programming (ILP) (“Methods”, Supplemen- 
tary Fig. 2). We solve different optimization problem instances, 
each of which considers a different set of candidate target genes: 
1269 genes encoding cell surface receptor proteins, and subset of 
58 out of these 1269 genes that already have published ligand- 
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mimicking peptides, and a nested collection of sets of 424-900 
out of the 1269 genes that are lowly expressed across normal 
human tissues below a series of decreasing gene expression 
thresholds*?. From a computational standpoint, there is no 
inherent theoretical limit on the size of the candidate gene set, but 
in practice, such optimization tasks may become intractable as 
this size increases. Our formulation is personalized as each 
patient receives the minimum possible number of treatments. The 
global optimization comes into play only when there are multiple 
solutions of the same size to treat a patient. For example, suppose 
we have two patients such that patient A could be treated by 
targeting either {EGFR, FGFR2} or {MET, FGFR2} and patient B 
could be treated by targeting either {EGFR, CD44} or {ANPEP, 
CD44}. Then we prefer the CTS {EGFR, FGFR2, CD44} of size 3 
and we treat patient A by targeting {EGFR, FGFR2} and patient B 
by targeting {EGFR, CD44}. 

As the number of cells per patient varies by three orders of 
magnitude across data sets, we use random sampling to obtain 
hitting set instances of comparable sizes that adequately capture 
tumor heterogeneity. We found that sampling hundreds of cells 
from the tumor is sufficient to get enough data to represent all 
cells. In most of the experiments shown, the number of cells 
sampled, which we denote by c, was 500. In some smaller data 
sets, we had to sample smaller numbers of cells (“Methods”). As 
shown in (Supplementary Note 2, Supplementary Figs. 2, 3), 500 
cells, when available, are roughly sufficient for CTS size to plateau 
for our baseline parameter settings, Jb = 0.8, ub = 0.1,r = 2.0. 
The results using all cells and default parameters are shown in 
Supplementary Table 1 and are similar to the results using 
sampling, where we consider the latter more informative for 
future studies. Hence, for each individual within a data set, we 
performed independent sampling of c cells 20 times and their 
results were summarized. 


Cohort and individual target set sizes as functions of tumor 
killing and non-tumor sparing goals. Given the single-cell 
tumor data sets and the ILP optimization framework described 
above, we first studied how the resulting optimal cohort target set 
(CTS) may vary as a function of the parameters defining the 
optimization objectives in different cancer types. Figure 2 and 
Supplementary Figs. 4-8 in Supplementary Note 3 show heat- 
maps of CTS sizes when varying /b, ub, and r around the baseline 
values of 0.8, 0.1, and 2.0, respectively. The CTS sizes for mela- 
noma were largest, partly due to the larger number of patients in 
that data set (Table 1). Indeed, as we sampled subsets of 5 or 10 
patients uniformly and observed that the mean CTS sizes grew 
from 7.9 (5 patient subsets) to 12.3 (10 patient subsets) to 31.0 (all 
patients, as shown in Fig. 2). 

Encouragingly, for most data sets and parameter settings, the 
optimal CTS sizes are in the single digits. However, in several data 
sets, we observe a sharp increase in CTS size as /b values are 
increased above 0.8 and/or as the ub is decreased below 0.1, with 
a more pronounced effect of varying /b. This transition is more 
discernable at the lowest value of r (1.5), probably because when r 
is lower, it becomes harder to find genes that are individually 
selective in killing tumor cells and sparing non-tumor cells 
(Supplementary Figs. 4-8). The qualitative transition observed in 
CTS sizes occurs robustly regardless of the threshold for filtering 
out low expressing cells when preprocessing the data (Supple- 
mentary Note 4, Supplementary Figs. 9-11). 

We next examined what are the resulting individual target set 
(ITS) sizes obtained in the optimal combinations under the same 
conditions. In all data sets, the mean ITS sizes are in the single 
digits for most values of lb and ub. The distributions of ITS sizes 
are shown for four data sets and two combinations of (Jb, ub) 


(Fig. 3) and for additional data sets in Supplementary Note 5 
(Supplementary Fig. 12). Overall, the mean ITS sizes with the 
baseline parameter values (r= 2.0,/b=0.8,ub=0.1) range 
from 1.0 to 3.91 among the nine data sets studied (Supplementary 
Table 3); on average 4 targets per patient should hence suffice if 
enough single-target treatments are available in the cohort target 
set. However, there is considerable variability across patients. 

Evidently, as we make the treatment requirements more 
stringent (by increasing /b from 0.8 to 0.9 and decreasing ub from 
0.1 to 0.05), the variability in ITS size across patients became 
larger. Importantly, this analysis provides rigorous quantifiable 
evidence grounding the prevailing observation that among 
tumors of the same type, some individual tumors may be much 
harder to treat than others. Taken together, these results show 
that we can compute precise estimates of the number of targets 
needed for cohorts (in the tens) and individual patients (in the 
single digits usually) and that these estimates are sensitive to the 
killing stringency, especially when the /b increases above 0.8. The 
variation for more aggressive killing regimes, with values of /b up 
to 0.99 for the baseline r = 2.0 is displayed in Supplementary 
Figs. 13 and 14 in Supplementary Note 6. For fixed /b = 0.8, ub = 
0.1 and varying r, to values as high as 5.0, smallest CTS sizes are 
typically obtained for r values close to 2.0, further motivating our 
choice of r= 2 as the default value (Supplementary Note 7, 
Supplementary Figs. 15 and 16, Supplementary Table 2). Finally, 
we show that, as expected, a ‘control’ greedy heuristic algorithm 
searching for small and effective target combinations finds ITS 
sizes substantially larger than the optimal ITS sizes identified 
using our optimization algorithm (Fig. 3). The greedy CTS size is 
greater than the ILP optimal CTS size for eight out of nine data 
sets (Supplementary Table 3 in Supplementary Note 8, “Meth- 
ods”). The comparison between heuristic and optimal solutions is 
of interest because it quantifies the benefit of finding the 
optimum-size ITS and CTS and the previous related studies 
reviewed in the Introduction used only heuristic methods for 
solution sizes above 2 or 3, respectively?**°. 


The landscape of combinations achievable with receptors cur- 
rently targetable by published ligand-mimicking peptides or 
those tested in CAR-T trials. To get a view of the combination 
treatments that are possible with receptor targets for which there 
are already existing modular targeting reagents, we conducted a 
literature search identifying 58 out of the 1269 genes with pub- 
lished ligand-mimicking peptides that have been already tested in 
in vitro models, usually cancer models (“Methods”; Tables 3 
and 4). We asked whether we could find feasible optimal com- 
binations in this case and if so, how do the optimal CTS and ITS 
sizes compare vs. those computed for all 1269 genes? 

Computing the optimal CTS and ITS solutions for this basket 
of 58 targets, we found feasible solutions for six of the data sets 
across all parameter combinations we surveyed and three of these 
six are illustrated for each patient in Fig. 4. However, for three 
data sets, in numerous parameter combinations we could not find 
optimal solutions that satisfy the optimization constraints 
(Supplementary Notes 9, Supplementary Figs. 17-19). That is, 
the currently available targets do not allow one to design 
treatments that may achieve the specified selective killing 
objectives, underscoring the need to develop new targeted cancer 
therapies, to make personalized medicine more effective for more 
patients. 

Overall, comparing the optimal solutions obtained with 58 
targets to those we have obtained with the 1269 targets, three 
qualitatively different behaviors are observed (Supplementary 
Notes 9, Supplementary Figs. 17-19): (1) In some data sets, it is 
just a little bit more difficult to find optimal ITS and CTS 
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Fig. 2 Heatmaps showing how the cohort target set (CTS) size varies as a function of Ib, ub, r, and across data sets. For each plot the x-axis and y-axis 
represent /b and ub parameter values, respectively. The scale on the right shows the cohort target set sizes by color scale. We show separate plots for 
r = 2.0,3.0 here and a larger set {1.5, 2.0, 2.5, 3.0} in Supplementary Note 3. Individual values are not necessarily integers because each value represents 
the mean of 20 replicates of sampling c (500 for each of the data sets shown here) cells (Supplementary Fig. 2). Source data are provided as a Source 


data file. 


solutions with the 58-gene pool, while in others, the restriction to 
a smaller pool can be a severe constraint making the optimization 
problem infeasible. (2) The smaller basket of gene targets may 
force more patients to receive similar individual treatment sets 
and thereby reduces the size of the CTS. (3) Unlike the CTS size, 
the ITS size must stay the same or increase when the pool of 
genes is reduced, because we find the optimal ITS size for each 
patient. Overall, the average ITS sizes across each cohort using the 
pool of 58 genes for baseline settings range from 1.16 to 4.0. 
Among cases that have any solution, the average increases in the 
ITS sizes at baseline settings in the 58 genes case vs. that of the 
1269 case were moderate, ranging from 0.16 to 1.33. 

We performed a similar analysis with 57 gene targets that were 
listed as the genes encoding single proteins in CAR-T trials 
(Supplementary Table 4; see one visualization of the results in 
Supplementary Fig. 20)?2. However, using our baseline parameter 
settings and either with cell sampling or analyzing all cells 
together only three out of nine data sets have feasible solutions 
because the majority of the CAR-T target genes were filtered out 
by the researchers who collected the primary data. Those include 
brain cancer GSE89567 (mean CTS 2.95, ITS 1), brain cancer 
GSE102130 (mean CTS 4, ITS 2), and breast cancer GSE118389 
(mean CTS 7, ITS 3). This limited success suggests that, to cover 
more cancer indications, CAR-T targets that are more differen- 
tially expressed in solid tumors need to be developed and that it 
may be necessary to target multiple receptors simultaneously?”°. 

The success of CAR-T targeted to CD19 for B-cell acute 
lymphocytic leukemia (B-ALL)?” and the availability of one 
single-cell B-ALL data set (GSE132509, “Methods”)?® allowed us 
to assess further whether our 0.8 killing threshold for malignant 
cells reasonably captures the success observed in a human trial. 
After filtering out cells expressing <20% of genes and using only 
the cells consistently annotated as malignant in the six B-ALL 


samples, we found that a mean of 80.3% per patient of malignant 
cells express CD19 (“Methods”). This result is consistent with the 
0.8 baseline value of Jb we have used throughout the paper. The 
annotations of the non-malignant cells were not sufficiently 
consistent to assess the 0.1 baseline value of ub. Notably, blood 
cancers are fundamentally different from solid tumors. A 
different line of support for the choice of the 0.8 threshold is 
provided in “Methods”. 


Optimal fairness-based combination therapies for a given 
cohort of patients. Until now we have adhered to a patient- 
centered approach that aims to find the minimum-size ITS for 
each patient, first and foremost. We now study a different, 
cohort-centered approach, where given a cohort of patients, we 
seek to minimize the total size of the overall CTS size, while 
allowing for some increase in the ITS sizes. The key question is 
how much larger are the resulting ITS sizes if we optimize for 
minimizing the cohort (CTS size), rather than the individuals 
(ITS size)? This challenge is motivated by a ‘fairness’ perspective 
(Supplementary Note 1), where we seek solutions that are bene- 
ficial for the entire community from a social or economic per- 
spective (in terms of CTS size) even if they are potentially sub- 
optimal at the individual level (in terms of ITS sizes). Here, the 
potential benefit is economic since running a basket trial would 
be less expensive if one reduces the size of the basket of available 
treatments (Fig. 5A, B). 

We formalized this ‘fair CTS problem’ by adding a cost 
parameter « that specifies the limit on the excess number of (ITS) 
targets selected for any individual patient, compared to the 
number selected in the individual-based approach that was 
studied up until now (formally, the latter corresponds to setting 
a = 0). We formulated and solved via ILP this fair CTS problem 
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Fig. 3 The distribution of optimal and greedy individual target set (ITS) size values in four different cancer types. We study both our baseline 


parameter setting (upper row panels) and a markedly more stringent one (middle row plots). For the more stringent parameter setting, we compare the ITS 
sizes obtained using MadHitter (middle row plots) and a greedy algorithm that tries to add pairs of genes at a time (bottom row plots). In each plot the 
patients are assigned letters on the x-axis and the ITS size is on the y-axis. In each of the 12 plots, the patients are sorted from left to right in increasing 
order of the mean ITS values in the optimal stringent (Ib = 0.9, ub = 0.05) regime. For each patient and each parameter setting, we generated 20 replicates 
by sampling 500 cells and the box plot shows the 20 data points and their mean. For most patients in the head and neck and melanoma data sets, there is 
no variation in the ITS value among the 20 replicates and the 20 circles are hence overplotted, but as evident, for two patients each in the brain and lung 
cancer data sets, there is considerable variation in the ITS sizes between replicates. Additional comparisons between ITS sizes at different parameter 

settings can be found in Supplementary Note 5. A description of the greedy algorithm and more comparisons between the optimal and greedy algorithms 
are provided in Supplementary Note 8. All boxplots are drawn with the ggplot2 library function geom_boxplot, which shows the median as a horizontal line, 
puts the hinges at 25th and 75th percentiles and puts the whiskers at 1.5x Interquartile-Range from the nearest hinge. Source data are provided as a Source 


data file. 


for up to 1269 possible targets on all nine data sets (“Methods”). 
We fixed r = 2 and ub = 0.1 while varying a and Ib. Figure 5C 
and Supplementary Figs. 21-25 in Supplementary Note 10 show 
the optimal CTS and ITS sizes for a = 0,...,5. 

For 8 out of 9 data sets, we encouragingly find that the 
unfairness cost parameter a is bounded by a constant of 3; ice., it 
is sufficient to increase a by no more than 3 to obtain the smallest 
CTS sizes in the optimally fair solutions. For the largest data set 
(melanoma), « = 4. As we show in Supplementary Note 10, 
empirically, even if one requires lower a values, then as those 
approach 0, the size of the fairness-based CTS grows fairly 
moderately and remains in the lower double digits, and the mean 
size of the number of treatments given to each patient (their ITS) 
is overall <5. Theoretically, we show that one can design instances 
for which a would need to be at least ./n — 1 to get a CTS of size 
less than the overall number of targets n (Supplementary 
Note 10). However, in practice, we find that given the current 
tumor single-cell expression data, fairness-based treatment 
strategies are likely to be a reasonable economic option in the 
future. 


The landscape of optimal solutions targeting differentially 
expressed tumor receptors that are lowly expressed across 
many healthy tissues. We turn to examine the space of optimal 
solutions when restricting the set of eligible surface receptor gene 
targets to those that have lower expression across many non- 
cancerous human tissues (“Methods”), aiming to mitigate 
potential damage to tissues unrelated to the tumor site. To this 
end, we selected subsets of the 1269 cell surface receptor targets in 
which the genes have overall lower expression across multiple 
normal tissues, by mining GTEx and the Human Protein Atlas 
(HPA) (“Methods”). Varying the selectivity expression thresholds 
(expressed in transcripts per million (TPM)) used to filter out 
genes whose mean expression across the normal adult tissues is 
above values of 10, 5, 2, 1, 0.5, and 0.25 (i.e., employing more and 
more extensive filtering as this threshold is decreased), decreases 
the size of the target cell surface receptor gene list by more than 
half (Table 2). 

As shown in Fig. 6A, B (and Supplementary Figs. 26-28), 
MadHitter identifies very different cohort target sets (which are 
larger than the original optimal solutions, as expected) as the 
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Fig. 4 Comparison of individual target set sizes with 1269 or 58 targets for three out of the six data sets that have feasible solutions. We attempted to 
find feasible solutions for all patients using 58 cell surface receptors that have published ligand-mimicking peptides that have been tested in vitro or in pre- 
clinical models. There are feasible solutions for all patients in six data sets, but not for the brain (GSE84465), melanoma (GSE115978), and lung (E-MTAB- 
6149), which were displayed in previous figures. Instead, we show here results for breast and colorectal cancers, for which other analyses, such as those in 
Figs. 2 and 3, are in the Supplementary Notes. Some of the optimal solutions obtained on the 58-receptors restricted set are of the same size to those 
obtained on the whole receptors set and some are larger. For each data set, we sampled 20 replicates of either 250 cells (breast), 100 cells (colorectal) or 
500 cells (head and neck). Except for the 58-gene analysis of head and neck patient J (with ITS size 1 in 6 replicates, size 2 in 11 replicates, size 3 in 3 
replicates), each patient had the same ITS size in each replicate. All boxplots are drawn with the ggplot2 library function geom_boxplot, which shows the 
median as a horizontal line, puts the hinges at 25th and 75th percentiles and puts the whiskers at 1.5x Interquartile-Range from the nearest hinge. Source 


data are provided as a Source data file. 


TPM selectivity threshold value is decreased. Furthermore, 
different ITS instances may become infeasible (Supplementary 
Fig. 29). At an individual patient level, using lower selectivity 
threshold levels, which leads to a smaller space of membrane 
receptors to choose from, also leads to increased mean ITS sizes 
(Supplementary Figs. 30, 31). Across the nine data sets, the 
selectivity threshold at which the CTS problem became infeasible 
varied (Supplementary Fig. 29). The differences observed could be 
the result of expression heterogeneity of the cancer, number of 
patients within the data set, size of target gene set, lack of 
expression of available gene targets, and other unknown factors. 
In the future, further experimentation is required to identify 
tissue-specific optimal gene expression thresholds that will 
minimize side effects while allowing cancer cells to be killed by 
combinations of targeted therapies. Finally, for completeness, we 
also tested MadHitter on the set of 533 lowly expressed genes 
suggested by MacKay et al.?° All instances with default setting of 
r, lb, ub have feasible solutions for all patients. Mean ITS sizes are 
below 4 for eight of nine data sets, but close to 10 for the brain 
cancer data set GSE84465. More details can be found in 
Supplementary Note 11 and Supplementary Table 5. 


Key targets composing optimal solutions across the space of 
1269 membrane receptors. To identify the genes that occur most 
often in optimal solutions for our baseline settings, since there 
may be multiple distinct optimal solutions composed of different 
target genes, we sampled up to 50 optimal solutions for each 
optimization instance solved and recorded how often each gene 
occurs and how often each pair of genes occur together 
(“Methods”). We analyzed and visualized these gene (co-)occur- 
rences in three ways. First, we constructed co-occurrence circos 
plots in which arcs around the circle represent frequently 
occurring genes and edges connect targets that frequently co- 


occur in optimal CTS solutions. Figure 6C-F shows the co- 
occurrence visualizations for optimal CTS solutions obtained with 
the original, unfiltered target space of 1269 genes and in baseline 
parameter settings. The genes frequently occurring in optimal 
solutions are quite specific and distinct between different cancer 
types. In melanoma, the edges form a clique-like network because 
virtually all optimal solutions include the same clique of 12 genes 
(Fig. 6E). The head and neck cancer data set has only one 
commonly co-occurring pair {GPR87, CXADR}, partly because 
the CTS sizes are much smaller, ranging from 2 to 5, and 
the optimal solutions of sizes larger than 2 vary substantially in 
the choices of the other gene(s) (Fig. 6D). The choices of these 
two genes are not obvious as CXADR ranks only 10th by our 
measure of gene overexpression (Supplementary Note 8) and 
GPR87 is not in the top 20 genes. Of the cancer types not depicted 
in Fig. 6, the breast cancer data set has a commonly co-occurring 
set of size 4, {CLDN4, INSR, P2RY8, SORL} none of which ranks 
in the top 20 genes, and the colorectal cancer data set has a 
different commonly co-occurring set of size 4, {GABRE, GPRR, 
LGR5, PTPRJ}. These variations should be viewed as character- 
istic of different data sets, not of different tumor types, since we 
see different results for the four brain cancer data sets. 

We next tabulated sums of how often each gene occurred in 
optimal solutions for all nine data sets (Supplementary Note 12, 
Supplementary Tables 6, 7 and 8), obtained when solving for 
either 58 gene targets or 1269 gene targets. Strikingly, one gene, 
PTPRZ1 (protein tyrosine phosphatase receptor zeta 1), appears 
far more frequently than others, especially in three brain cancer 
data sets (GSE70630, GSE89567, GSE102130, Supplementary 
Table 8). PTPRZ1 also occurs commonly in optimal solutions for 
the head and neck cancer data set (Fig. 6D). The brain cancer 
finding coincides with previous reports that PTPRZ1 is over- 
expressed in glioblastoma (GBM)*?*°, PTPRZ1 also forms a 
fusion with the nearby oncogene MET in some brain tumors that 
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Fig. 5 A schematic example demonstrating the rationale and workings of fairness-based solutions. A, B Let us assume that each of three patients has 
two tumor cells (columns), each displaying five membrane receptors that are highly expressed only on the tumor cells and not on the non-tumor ones 
(rows). If we target {APP, MET} (panel A, cost unfairness parameter a = 1) in all patients, then this achieves a cohort target set (CTS) size of 2, which is 
the minimum possible. Employing the original individual-based optimizing objective, each patient could instead be treated by an individual target set (ITS) 
of size 1 by targeting the distinct receptors called Target 1 (specific to patient 1), Target 2, and Target 3, respectively, but this would result in an optimal 
CTS of size 3 (panel B, cost unfairness parameter a = 0). The solution in panel A has an unfairness value a = 1 because the worst difference among all 
patients is that a patient receives 1 more treatment than necessary. C Heatmaps showing how the CTS size varies as a increases (y-axis), starting from its 
baseline value of O where each patient is assigned a minimum-sizes individual treatment set (top row). The lower bound on tumor cells killed (x-axis) is 
also varied while the upper bound on non-tumor cells killed is kept fixed at 0.1. We are particularly interested in finding the smallest value on the y-axis at 
which the CTS size reaches its minimum value, which is circled for the baseline [b = 0.8, because this bounds the tradeoff between the achievable 
reduction in the number of targets needed to treat the whole cohort and the number of extra targets above the ITS minimum that any patient might need to 
receive. Source data for panel C are provided as a Source data file. 


have an overexpression of the fused MET*!. Notably, various cell 
line studies and mouse studies have shown that inhibiting 
PTPRZ1, for example by shRNAs, can slow glioblastoma tumor 
growth and migration***3, There have been some attempts to 
inhibit PTPRZ1 pharmacologically in brain cancer and other 
brain disorders*+*5, In the four brain cancer data sets, PTPRZ1 is 
expressed selectively above the baseline r=2.0 in 0.99 
(GSE89567), 0.84 (GSE70630), 0.96 (GSE102130), and 0.27 
(GSE84465) proportion of cells in each cohort. The much lower 


relative level of PIPRZ1 expression in GSE84465 is likely due to 
the heterogeneity of brain cancer types in this data set*°. Among 
the 58 genes with known ligand-mimicking peptides, EGFR 
stands out as most common in optimal solutions (Supplementary 
Table 6). Even when all 1269 genes are available, EGFR is most 
commonly selected for the brain cancer data set (GSE84465) in 
which PTPRZ1 is not as highly overexpressed (Fig. 6C). 
PTPRZ1 was the fifth most frequently occurring gene in 
optimal solutions for the head and neck cancer data set 
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Table 2 Size of high confidence target gene sets for different thresholds. 
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Fig. 6 Variation in the cohort target set (CTS) size and composition as function of the magnitude of filtering of genes expressed in non-cancerous 
human tissues, for different tumor types. A, B The number of times a gene (cell-surface receptor) is included in the CTS (out of 20 replicates, which is 
therefore the Max count in panels A, B), where each column presents the CTS solutions when the input target genes sets are filtered using a specific TPM 
filtering threshold (“Methods”), for (A) breast cancer and (B) brain cancer. These data sets were selected due to their relatively small cohort target set 
sizes, permitting their visualization. C-F Circos plots of the genes occurring most frequently in optimal CTS solutions (length of arc along the 
circumference) and their pairwise co-occurrence (thickness of the connecting edge) for the four main cancer types, in our original target space of 1269 
encoding cell-surface receptors. For each data set, we sampled up to 50 optimal CTS solutions. Co-occurrence representations of the 12 most common 
target genes out of 1269 encoding cell-surface receptors (with >5% frequency of occurrence) are represented in a cancer-specific manner for C brain 
cancer, (D) head and neck cancer (only seven genes have a frequency of 5% or more across optimal solutions), (E) melanoma, and (F) lung cancer. Genes 


and connections have distinct colors for improved visibility. Source data are provided as a Source data file. 


(GSE103322). The two most common genes by a large margin are 
CXADR and GPR87. CXADR has been studied primarily by 
virologists and immunologists because it encodes a receptor for 
cocksackieviruses and adenoviruses*’. In one breast cancer study, 
CXADR was found to play a role in regulating PTEN in the AKT 
pathway, but CXADR was underexpressed in breast cancer*® 
whereas it is overexpressed in the head and neck cancer data we 
analyzed. GPR87 is a rarely studied G protein-coupled receptor 
with an unknown natural ligand*?. In the context of cancer, 
GPR87 has previously been reported as overexpressed in several 


tumor types including lung and liver*? and its overexpression 
may play an oncogenic role via either the p53 pathway, the 
NFxB pathway?! or other pathways. 

Finally, we analyzed the set of genes in optimal solutions via 
the STRING database and associated tools°* to perform several 
types of gene set and pathway enrichment analyses. Supplemen- 
tary Figs. 32-35 (Supplementary Note 12) show STRING-derived 
protein-protein interaction networks for the 25 most common 
genes in the same four data for which we showed co-occurrence 
graphs in Fig. 6C-F. Again, EGFR stands out as being a highly 
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connected protein node in the solution networks for both the 
brain cancer and head and neck cancer data sets. Among the 30 
genes in the 1269-gene set that occur most commonly in optimal 
solutions (Supplementary Table 7), there are six kinases (out of 88 
total human transmembrane kinases with a catalytic domain, 
STRING gene set enrichment p<le — 6), namely {EGFR, EPHB4, 
ERBB3, FGFR1, INSR, NTRK2} and two phosphatases {PTPR], 
PTPRZ1}. The KEGG pathways most significantly enriched, all at 
FDR<0.005, are (“proteoglycans in cancer”) represented by 
{CD44, EGFR, ERBB3, FGFR1, PLAUR}, (“adherens junction”) 
represented by {EGFR, FGFR1, INSR, PTPRJ}, and (“calcium 
signaling pathway”) represented by {EDNRB, EGFR, ERBB3, 
GRPR, P2RX6}. The one gene in the intersection of all these 
pathways and functions is EGFR. 


Discussion 

In this multi-disciplinary study, we harnessed techniques from 
combinatorial optimization to analyze publicly available single- 
cell tumor transcriptomics data to chart the landscape of future 
personalized combinations that are based on ‘modular’ therapies, 
including CAR-T therapy. We showed that, for most tumors we 
studied, four modular medications targeting different over- 
expressed receptors may suffice to selectively kill most tumor 
cells, while sparing most of the non-cancerous cells (Figs. 2 and 3 
and Table S3). For the more restricted sets of low-expression 
genes** or the 58 receptors with validated ligand-mimicking 
peptides (Tables 3 and 4), some patients do not have feasible 
solutions, especially as we reduce the TPM expression used for 
filtering the gene set to avoid targeting non-cancerous tissues. 
These findings indicate, on one hand, that researchers designing 
ligand-mimicking peptides have been astute in choosing targets 
relevant to cancer. On the other hand, these results suggest that 
there is a need for extending the set of cell surface receptors that 
can be targeted to enter tumor cells with ligated chemotherapy 


Table 3 Single proteins that can be targeted by peptides 
based on references'®? and are expressed on the cell 
surface7>. 


Protein 


GPC3 
IL-IORA 
IL-11Ra 
IL-13Ra2 
IL-6Ra 
GP130 
VEGFR2 
MUC18 
Met 
MMpP9 
Thomsen- 
Friedenreich 
carbohydrate 
antigen 
NRP-1 
PDGFRB 
CD133 
PTPRJ 
HSPG 
E-selectin 
Tie2 
VPAC1 


Protein 


APN/CD13 
APP 

PD-L1 
CD44 
P32/gC1qgR 
E-cadherin 
N-cadherin 
CD21 
EGFR 
Epha2 
EphB4 


Gene symbol 


ANPEP 
APP 
CD274 
CD44 
CD93 
CDH1 
CDH2 
CR2 
EGFR 
EPHA2 
EPHB4 


Gene symbol 


GPC3 
ILIORA 
ILTIRA 
ILI3RA2 
IL6R 
IL6ST 
KDR 
MCAM 
MET 
MMpP9 
MUC1 


HER2 
FGFR1 
FGFR2 
FGFR3 
FGFR4 
VEGFR1 
VEGFR3 
PSMA 


ERBB2 
FGFR1 
FGFR2 
FGFR3 
FGFR4 
FLT1 
FLT4 
FOLH1 


NRP1 
PDGFRB 
PROMI 
PTPRJ 
SDC2 
SELE 
TEK 
VIPR1 


For easier correspondence with the gene expression data, the entries are listed in alphabetical 
order by gene symbol. The protein/gene in the two leftmost columns go together as do the 
protein/gene in the two rightmost columns. In this table, we follow the clinical genetics 
ormatting convention that proteins are in Roman and gene symbols are in italics. 


agents. There are two established methods to identify ligand- 
mimicking peptides called “one bead one compound” and “phage 
display””°, the latter of which was awarded the 2018 Nobel Prize 
in Chemistry. Both methods are applicable to all cell surface 
proteins and thus, we believe that the set of proteins with vali- 
dated peptides could be greatly expanded as nanoparticle-based 
treatments get closer to being used in the clinic. 

Remarkably, we found that if one designs the optimal set of 
treatments for an entire cohort adopting a fairness-based policy, 
then the size of the projected treatment combinations for indi- 
vidual patients are at most 3 targets larger, and in most data sets 
at most 1 target receptor larger than the optimal solutions that 
would have been assigned for these patients based on an 
individual-centric policy (Fig. 5, Supplementary Note 10). This 
suggests that the concern that the personalized treatment for any 
individual will be sub-optimal solely because that individual 
happens to have registered for a cohort trial appears to be tightly 
bounded. 

We had three principal motivations for treating each set of 
patients as a formal cohort. First and foremost, clinical trials are 
very expensive, and we are interested in what economies of scale 
can be found. Second, “basket trials”, such as the WINTHER 
trial5? and the UMIN000016794 trial°4, in which there are broad 
eligibility criteria and different patients receive different treat- 
ment based on an algorithm, have started; we think that such 
basket trials will be increasingly important to realize personalized 
oncology. Thirdly, the optimization problem that favors reusing 
gene targets for different patients may be of interest to algor- 
ithmically inclined readers, and may perhaps be useful in other 
applications. 

Like the study of MacKay et al.??, our study is a conceptual 
computational investigation. As in the studies of Dannenfelser 
et al.2! and MacKay et al.?”, we assumed that treatment selection 
will be based on measuring gene expression; neither we nor they 
considered the possibility of treatments that target mutant pro- 
teins. Our framework could accommodate mutant proteins, if 
mutations are called from the single-cell mRNA reads, but this is 
challenging since the reads usually do not cover full mRNAs. That 
our framework captures some general aspects of conjugated 
antibodies, nanoparticles, and CAR-T as distinct treatment 
technologies is on one hand a strength because its generality, and 
on the other hand a weakness because we ignore various 
technology-specific issues that require further study. A simple 
example of a distinction is that CAR-T can have any cell surface 
protein as a target while conjugated antibodies and nanoparticles 
require a cell-surface protein that is a receptor that functions by 
RME. A more complex example is that for ligand-mimicking 
peptides to work on nanoparticles one must experimentally 
measure various ligand-specific and receptor-specific parameters, 
including the possibility of non-specific binding of a ligand to 
unintended receptors*>>>. If these parameters were measured for 
a variety of ligands and receptors, then such data would open the 
possibility of a variety of heterogeneous nanoparticle designs 
including multiple ligands and decoy receptors being placed on 
single nanoparticles*>°°. Our methods and software present 
combinatorial optimization as one basis for addressing these 
more complex challenges, but other forms of mathematical 
optimization could be further helpful in identifying the most 
potent and selective nanoparticle designs from an exponentially 
large space of possibilities, considering more refined properties of 
the specific nanoparticle conjugates in hand. 

We have focused on developing methods based on analyzing 
single-cell transcriptomics data because it is the predominant 
current technology to collect abundant data on single cells of 
different types in a tumor’s microenvironment on a large scale. 
A key motivation for our study was the observation that single- 
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Table 4 Single proteins that can be targeted by ligand-mimicking peptides but are not included in the two principal reviews that 


we consulted!?:20 and are among 1269 cell surface receptors”>. 


Protein 


ActRIIB 

CD163 

CXCR4 

Ephrin A4 

Ephrin B1 

Ephrin B2 

Ephrin B3 

Gonadotrophin releasing hormone receptor 
G Protein-coupled receptor 55 

Bombesin receptor 2 

IL4 receptor 

Low-density lipoprotein receptor 

Leptin receptor 

LRP1 

Melanocortin 1 receptor 

Melanocortin 4 receptor 

CD206 

Urokinase plasminogen activator receptor 
Neurokinin-1 receptor 

VPAC2 


ACVR2B 
CD163 
CXCR4 
EPHA4 
EPHB1 
EPHB2 
EPHB3 
GNRHR 
GPRSS 
GRPR 
ILAR 
LDLR 
LEPR 
LRP1 
MCIR 
MCA4R 
MRC1 
PLAUR 
TACRI 
VIPR2 


Gene symbol 


At least one PubMed ID 


28955765 
27563889 

9482312, 22523575 
5681844, 22523575 
5722342, 22523575 
5722342, 22523575 
5722342, 22523575 
20814857, 22523575 
28029647 
20814857, 22523575 
9012727 

27656777 

9233229, 26265355 
29090274 
22964391 

7591746 

30768279 
25080049 
29498264 
30077368 


Since the evidence that these 20 genes have ligand-mimicking peptides is scattered in the literature, we include at least one PubMed ID of a paper describing a suitable peptide. Gene symbols are in 
italics because that is the convention in human genetics. 


cell transcriptomics is not yet used in the clinic, and our hope is 
that demonstrating its potential value for identifying tumor cells 
specific treatment targets may possibly facilitate the collection 
and analysis of such data by clinicians and bioinformaticians for 
advancing patients treatment. As such, our work is a starting 
point. In the future, numerous additional omics technologies 
should be brought in to deeply and carefully identify the best 
treatments, that also would need to be carefully tailored according 
to the specific treatment modality (as CAR-T, conjugate anti- 
bodies, nanoparticles, etc. each has its own properties). Those 
include transcriptomics data with spatial information®’, single- 
cell proteomics*®°°, and other immunoprofiling/imaging® 
methods, that may supplant transcriptomics as the method of 
choice for single-cell genomics data collection, as they become 
available on a large scale. 

We studied nine data single-cell expression data sets here, but 
it would be helpful to analyze more and larger data sets in the 
future. Even among four data sets of the same (brain) cancer type, 
we observed considerable variability in CTS and ITS sizes. Since 
our approach is general and the software is freely available as 
source code, other researchers can test the method on new data 
sets or add new variables, constraints, and optimality criteria. For 
example, we chose to analyze each single cell as a unit; many 
analyses of scRNA cancer data choose to cluster the cells first and 
may instead use cell clusters as the units of analysis. In that view, 
analysis of differential gene expression is usually done between 
clusters, while we preferred to compare the expression of each 
gene in each cell to the expression of the same gene in all non- 
cancer cells. Our investigation may lead others to further improve 
our method and broaden its applicability. In future work, we plan 
to apply our approach to study ways for selectively killing specific 
populations of immune cells, such as myeloid-derived suppressor 
cells or T regulatory cells, because they inhibit tumor killing, 
while sparing most other non-cancer cells. 

A limitation of our analysis on real data sets is that the non- 
cancer cells in the nine data sets we analyzed come from the 
tumor microenvironment. Our framework allows the input data 
to include non-cancer cells from any tissue in the patients 


consistent with the notion that CAR-T therapy is delivered sys- 
temically and can have on-target, off-tumor effects if other cell 
types (over)express the target protein. There are ongoing efforts 
in wet labs aiming to improve CAR-T therapy by combinatorial 
logic so that only cells overexpressing the target protein would be 
killed, as in our framework. This issue is covered in Supple- 
mentary Note 1. 

We compared gene expression levels between non-cancer cells 
and cancer cells sampled from the same patient, which avoids 
inter-patient expression variability®!. Following recent systematic 
efforts that have concluded that employing imputation in the 
analysis of single-cell RNA-seq data has no clear benefits as the 
zeroes observed in the data reflects both true gene expression and 
measurement error®?-®, we chose to avoid an imputation step in 
our study. Of course, others may take different decisions and 
include imputation in preprocessing the input data in future 
studies. As this investigation combines complex concepts from 
different scientific disciplines, we strived for simplicity in our 
modeling; therefore, we have taken the measured gene expression 
values as is, but future studies may consider extending our 
approach considering the stochastic nature of the expression of 
different genes measured *’, possibly preferring targets whose 
expression is more stable. 

Even though the combinatorial optimization problems solved 
here are in the worst-case exponentially hard (NP-complete*? in 
computer science terminology), the actual instances that arise 
from the single-cell data could be either formally solved to 
optimality or shown to be infeasible with modern optimization 
software on the NIH Biowulf system, which has a hard limit of 
240 wall-clock hours for any job. Of note, Delaney et al. have 
recently formalized a related optimization problem in analysis of 
single-cell clustered data for immunology*°. Their optimization 
problem is also NP-complete in the worst-case and they could 
solve sets of up to size four using heuristic methods*°. We have 
shown that the optimal ILP solutions we obtained are often 
substantially smaller than solutions obtained via a greedy heur- 
istic (Fig. 3, Supplementary Note 8 including Supplementary 
Table 3). 
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On the cautionary side, experiments with target gene sets that 
were further filtered by low expression in normal tissues showed 
that the individual target set problem can become infeasible in 
many instances. Even when the instance remained feasible, 
optimal cohort treatment set sizes increased rapidly as the 
expression levels allowed decreased (Fig. 6), pointing to potential 
inherent limitations of applying such combination approaches to 
patients in the clinic and the need to carefully monitor their 
putative safety and toxicity in future applications. Finally, func- 
tional enrichment analysis of genes commonly occurring in the 
optimal target sets reinforced the central role of the widely stu- 
died oncogene EGFR and other transmembrane kinases. We also 
found that the less-studied phosphatase PTPRZ1 is a useful target, 
especially in brain cancer. 

In summary, this study harnesses combinatorial optimization 
tools to analyze emerging single-cell data to portray the landscape 
of feasible personalized combinations in cancer medicine. Our 
findings uncover promising membranal targets for the develop- 
ment of future oncology medicines that may serve to optimize the 
treatment of cancer patient cohorts in several cancer types. The 
MadHitter approach and the accompanying software made public 
can be readily applied to address additional fundamentally related 
research questions and analyze additional cancer data sets as they 
become available. 


Methods 

Data sets. We retrieved and organized data sets from NCBI’s Gene Expression 
Omnibus (GEO)®* and Ensembl’s ArrayExpess® and the Broad Institute’s Single 
Cell Portal (https://portals.broadinstitute.org/single_cell). Nine data sets had suf- 
ficient tumor and non-tumor cells and were used in this study; an additional five 
data sets had sufficient tumor cells only and were used in testing early versions of 
MadHitter. Suitable data sets were identified by searching sCRNASeqDB”®, 
CancerSea’!, GEO, ArrayExpress, Google Scholar, the TISCH resource (http:// 
tisch.comp-genomics.org), and the 10x Genomics list of publications (https:// 
www.10xgenomics.com/resources/publications/). We required that each data set 
contain measurements of RNA expression on single cells from human primary 
solid tumors of at least two patients and the metadata are consistent with the 
primary data. We did not impose any sequencing depth threshold and we included 
data sets from different technologies (including both SmartSeq2 and 10x). We 
excluded some individual patients with relatively few (typically <25) non-tumor 
cells depending on the size of the data set. As shown in Table 1, we excluded early 
data sets that had exclusively tumor cells or almost exclusively tumor cells. To our 
knowledge, all samples analyzed were treatment-naive. The melanoma data set 
includes both pre-treatment and post-treatment samples from some patients, but 
we analyzed only the pre-treatment samples for overall consistency. We are grateful 
to several of the data depositing authors of data sets for resolving metadata 
inconsistencies by e-mail correspondence and by sending additional files not 
available at GEO or ArrayExpress. 

We excluded blood cancers and data sets with single patients. The only 
exception is that we used one blood cancer data set (GSE132509) for a specific test 
of how we parameterized the killing thresholds. When it was easily possible to 
separate cancer cells from non-cancer cells of a similar type, we did so. In the 
specific case of GSE132509, there were two cell type annotations in GEO and a 
third annotation at (http://tisch.comp-genomics.org) and the malignant cells were 
far more consistent than the non-malignant cells. Therefore, for only the one 
analysis of GSE132509, we limited computations to the malignant cells that were 
consistent in the three annotations. 

The main task in organizing each data set was to separate the cells from each 
sample or each patient into one or more single files. Representations of the 
expression as binary, as read counts, or as normalized quantities such as transcripts 
per million (TPM) were retained from the original data. When the data set 
included cell type assignments, we retained those to classify cells as “cancer” or 
“non-cancer”, except in the data set of Karaayvaz et al.’2 where it was necessary to 
reapply filters described in the paper to exclude cells expressing few genes and to 
identify likely cancer and likely non-cancer cells. To achieve partial consistency in 
the genes included, we filtered data sets to include only those gene labels recognized 
as valid by the HUGO (Human Genome Organization) Gene Nomenclature 
Committee (http://genenames.org), but otherwise we retained whatever recognized 
genes that the data submitters chose to include. After filtering out the non- HUGO 
genes, but before reducing the set of genes to 1269 or 900 or 424 or 58, we filtered 
out cells as follows. Some data sets came with low expressing cells filtered out. To 
achieve some homogeneity, we filtered out any cells expressing fewer than 10% of 
all genes before we reduced the number of genes. As an exception, for the blood 
cancer data set (GSE132509) we found it necessary to raise the threshold to 20% 
because the data set had been noticeably less filtered than the main nine data sets. 


In Supplementary Note 4, we tested the robustness of this 10% threshold. Finally, 
we retained either all available genes from among either our set of 1269 genes 
encoding cell-surface receptors that met additional criteria on low expression or 
available ligand-mimicking peptides. 


Sampling process to generate replicates of data sets. As shown in Table 1, the 
number of cells available in the different single-cell data sets varies by three orders 
of magnitude; to enable us to compare the findings across different data sets and 
cancer types on more equal footing, we employed sampling from the larger sets to 
reduce this difference to one order of magnitude. This goes along with the data 
collection process in the real world as we might get measurements from different 
samples at different times. Suppose for a data set we have n genes, and m cells 
comprising tumor cells and non-tumor cells. We want to select a subset of m! <m 
cells. We select a set of m’ cells uniformly at random without replacement from 
among all cells. Then we partition the selected cells into m, tumor cells and m', 
non-tumor cells to define one replicate. In most of the computational experiments 
shown we used 20 replicates and we report either the arithmetic mean or entire 
distribution of quantities such as the CTS size. 


Considering a previously defined set of target genes and of HPA gene 
expression across different normal tissues. The general aim of our methods is to 
target the cancer cells while sparing the adjacent non-cancer cells as much as 
possible. A related concern is that genes within the target set could be expressed at 
high levels in other normal tissues that are not part of the non-cancer cells from the 
tumor microenvironment included in the input data sets. One way to address this 
problem is to identify genes that have low expression in the majority of the tissues 
and to use them to obtain a target set. This approach has been pioneered in a recent 
paper on selecting gene targets suitable for CAR-T therapy”. The authors selected 
533 candidate genes that they judged could be reasonable targets for CAR-T. They 
made this selection based on expression data from the Human Protein Atlas’? and 
the Genotype-Tissue Expression consortium (GTEx)”4, which have expression 
information from multiple tissues which was used to identify low expressed 
target genes. 

McKay et al.?* used a threshold of 15 TPM units of expression (written in their 
work as log2(TPM-+1) < 4), but they allowed a small number of tissues to exceed 
this threshold. Instead, we used quantitative levels of expression for finer 
granularity in analysis, as described in the next subsection. One clinical difference 
is that we looked only at adult tissues because we are analyzing adult tumors, while 
CAR-T therapy can be used for either childhood or adult tumors. The reason to 
focus on cell-surface receptors, as suggested by Dannenfelser et al.?!, is that CAR-T 
therapy requires a cell-surface target that may or may not be a receptor, antibody 
technologies require a cell surface receptor, and the ligand-mimicking peptide 
nanotechnology that we summarized in the Introduction also requires cell surface 
receptor targets. 


Construction of target gene sets that are lowly expressed in normal tissues. 
To analyze the tissue specificity of the 1269 candidate target genes, the RNA-seq- 
based multiple tissue expression data was obtained from the Human Protein Atlas 
(HPA) database (https://www.proteinatlas.org/about/download; Date: May 29, 
2020). The HPA database includes expression values (in units of transcripts per 
million (TPM)) for 37 tissues from HPA (rna_tissue_hpa.tsv.zip)’> and 36 tissues 
from the Genotype-Tissue Expression consortium (rna_tissue_gtex.tsv.zip)”*. Next, 
to identify target genes with low or no expression within majority of adult human 
tissues, for the 1269 candidate genes we identified genes whose average expression 
across tissues is below certain threshold value (0.25, 0.5, 1, 2, 5, and 10 TPM) in 
both HPA and GTEx data sets. Using the intersection of low expression candidate 
genes from HPA and GTEx data sets, we generated lists of high confidence targets. 
The size of the resulting high confidence target genes varied from 424 (average 
expression <0.25 TPM) to 900 (average expression across tissue <10 TPM) genes 
(Table 2). While the total number of genes decreases slowly, the decrease is much 
steeper if one excludes olfactory receptors and taste receptors (Table 2). These 
sensory receptors are not typically considered as cancer targets, although a few of 
these receptors are selected in optimal target sets when there are few alternatives 
(Fig. 6). MadHitter was run on all nine data sets using the expression information 
from the high confidence gene lists. 


Assembling lists of membrane target genes. We are interested in the set of 
genes G that (i) have the encoded protein expressed on the cell surface and (ii) for 
which some biochemistry lab has found a small peptide (i.e., amino acid sequences 
of 5-30 amino acids) that can attach itself to the target protein and get inside the 
cell carrying a tiny cargo of a toxic drug that will kill the cell and (iii) encode 
proteins that are receptors. The third condition is needed because many proteins 
that reside on the cell surface are not receptors that can undergo RME. The first 
condition can be reliably tested using a recently published list of 2799 genes 
encoding human predicted cell surface proteins’°; we reduced the list to 1269 by 
requiring that the proteins be receptors, which is necessary for RME-based 
therapies but not for CAR-T therapy?!. For condition (ii), we found two review 
articles in the chemistry literature! that list targets effectively meeting this 
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condition. Intersecting the lists meeting conditions (i) and (ii) gave us 38 genes/ 
proteins that could be targeted (Table 3). 

Most of the data sets listed in Table 1 had expression data on 1200-1220 of 
these genes because the list of 1269 includes many olfactory receptor genes that 
may be omitted from standard genome-wide expression experiments. Among the 
38 genes in Table 3, 13/14 data sets have all 38 genes, but GSE57872 was 
substantially filtered and has only 10/38 genes; since GSE57872 lacks non-tumor 
cells, we did not use this data set in any analyses shown. 

Because the latter review2° was published in 2017, we expected that there are 
now additional genes for which ligand-mimicking peptides are known. We found 
20 additional genes and those are listed in Table 4. Thus, our target set analyses 
restricted to genes with known ligand-mimicking peptides use 58 = 38 + 20 
targets. 


Definition of the minimum hitting set problem and solution feasibility. One of 
Karp’s original NP-complete problems is called “hitting set” and is defined as 
follows*. Let U be a finite universal set of elements. Let S,,S,,S3,...,S, be subsets 
of U. Is there a small subset H C U such that for i= 1,2,3,...,k, 8; H is non- 
empty. In our setting, U is the set of target genes and the subsets S, are the single 
cells. In ref. 7°, numerous applications for hitting set and the closely related pro- 
blems of subset cover and dominating set are described; in addition, practical 
algorithms for hitting set are compared on real and synthetic data. 

Among the applications of hitting set and closely related NP-complete problems 
in biology and biochemistry are stability analysis of metabolic networks’7-*!, 
identification of critical paths in gene signaling and regulatory networks®?-*4, and 
selection of a set of drugs to treat cell lines8>.8° or single patients®”88, Much more 
information about past related work can be found in Supplementary Note 1. 

Two different difficulties arising in problems such as hitting set are that (1) an 
instance may be infeasible meaning that there does not exist a solution satisfying all 
constraints and (2) an instance may be intractable meaning that in the time 
available, one cannot either (i) prove that the instance is infeasible or (ii) find an 
optimal feasible solution. All instances of minimum hitting set that we considered 
were tractable on the NIH Biowulf system. Many instances were provably 
infeasible; in almost all cases. we did not plot the infeasible parameter 
combinations. However, in Fig. 4, the instance for the melanoma data set with the 
more stringent parameters was infeasible because of only one patient sample, so we 
omitted that patient for both parameter settings in Fig. 4. 


Basic optimal target set formulation. Given a collection S = {S,,5,,5;,...} of 
subsets of a set U, the hitting set problem is to find the smallest subset H C U that 
intersects every set in S. The hitting set problem is equivalent to the set cover 
problem and hence is NP-complete. The following ILP formulates this target set 
problem: 


min Xb x(g) 


Dd x(g)21 VS, € S 
ges; 


() 


In this formulation, there is a binary variable x(g) for each element g € Uthat 
denotes whether the element g is selected or not. Constraint (1) makes sure that 
from each set S; in S, at least one element is selected. 

For any data set of tumor cells, we begin with the model that we specify a set of 
genes that can be targeted, and that is U. Each cell is represented by the subset of 
genes in U whose expression is greater than zero. In biological terms, a cell is killed 
(hit) if it expresses at any level on one of the genes that is selected to be a target (i-e., 
in the optimal target set) in the treatment. In this initial formulation, all tumor cells 
are combined as if they come from one patient because we model that the 
treatment goal is to kill (hit) all tumor cells (all subsets). In a later subsection, we 
consider a fair version of this problem, taking into account that each patient is part 
of a cohort. Before that, we model the oncologist’s intuition that we want to target 
genes that are overexpressed in the tumor. 


Combining data on tumor cells and non-tumor cells. To make the hitting set 
formulation more realistic, we would likely model that a cell (set) is killed (hit) only 
if one of its targets is overexpressed compared to typical expression in non-cancer 
cells. Such modeling can be applied in the nine single-cell data sets that have data 
on non-cancer cells to reflect the principle that we would like the treatment to kill 
the tumor cells and spare the non-tumor cells. 

Let NT be the set of non-tumor cells. For each gene g, define its average 
expression E(g) as the arithmetic mean among all the non-zero values of the 
expression level of g and cells in NT. The zeroes are ignored because many of these 
likely represent dropouts in the expression measurement. Following the design of 
experiments in the lab of N. A., we define an expression ratio threshold factor r as a 
real number whose baseline value is 2.0. We adjust the formulation of the previous 
subsection, so that the set representing a cell (in the tumor cell set) contains only 
those genes g such that the expression of g is greater than rx E(g) instead of 
greater than zero. We keep the objective function limited to the tumor cells, but we 
also store a set to represent each non-tumor cell, and we tabulate which non-tumor 
cells (sets) would be killed (hit) because for at least one of the genes in the optimal 
target set, the expression of that gene in that non-tumor cell exceeds the threshold 


rx E(g). We add two continuous parameters /b and ub each in the range [0,1] and 
representing, respectively, a lower bound on the proportion of tumor cells killed 

and an upper bound on the proportion of non-tumor cells killed. The parameters 
Ib, ub are used only in two constraints, and we do not favor optimal solutions that 
kill more tumor cells or fewer non-tumor cells, so long as the solutions obey the 
constraints. 

More explicitly, if we let E(g, C) be the expression level of gene g in cell C, then 
g is in the set of genes that can cover C if and only if E(g, C) > rx E(g). We inspect 
this relationship for every pair of genes and cells to come up with the hitting set 
instance of our ILPs. 

Our choice of the lower bound (Jb) on the proportion of tumor cells killed in 
the range [0.7,0.9] is somewhat arbitrary at this stage, but it is motivated by the 
widely used response evaluation criteria in solid tumors (RECIST) revised 
guidelines version 1.1°°. Those criteria include the category of partial response 
(PR), which is usually classified together with complete response (CR) as a 
responder when one makes a dichotomy between response and non-response. PR 
is defined as “At least a 30% decrease in the sum of diameters of target lesions, 
taking as reference the baseline sum diameters.” Because volume is proportional to 
the cube of diameter, a 30% decrease in the diameter sum corresponds to a 
decrease of at least 66% in the volume, which we round up to 70% (0.7). Because 
the tumor mass contains both tumor cells and non-tumor cells and we want to kill 
the tumor cells disproportionately, we increased the baseline threshold for /b to 
80% (0.8) and considered values as high as 0.95 or even 0.99 in some 
supplementary analyses. In any case, /b is a user-controlled parameter and can be 
set as the user of our software deems appropriate. Since most cancer treatments 
have some toxicity, we set the baseline value of ub to the low non-zero value 0.1; in 
some analyses, we use the lower value 0.05. If one wanted to distinguish the 
different types of non-tumor cells, one could add different upper bound 
parameters for each cell type. 


The fair cohort target set problem for a multi-patient cohort. We want to 
formulate an integer linear program that selects a set of genes S* from available 
genes in such a way that, for each patient, there exists an individual target set 
H;* C S*of a relatively small size (compared to the optimal ITS of that patient 
alone which is denoted by H(i)). 

Let U= {g1, g --» Sui} be the set of genes, where |U| denotes the number of 
elements in the universe of genes U. There are n patients. For the ith patient, we 
denote by S,,;), the set of tumor cells related to patient i. For each tumor cell 
C € Spy, we describe it as a set of genes which is known to be targetable to cell C. 
That is, g € C if and only if a drug containing g can target the cell C. In the ILP, 
there is a binary variable x(g) corresponding to each gene g € U_ that shows 
whether the gene g is selected or not. There is a binary variable x(g, P(i)) which 
shows whether a gene g is selected in the target set of patient P(i). Let a be an 
adjustable positive integer, which we interpret as a slack for each patient’s solution. 
The objective function is to minimize the total number of genes selected, subject to 
having a target set of size at most H(i) + a for patient P(i) where 1<i<n 
(constraint (2)). 

Constraint (3) ensures that, for patient P(i), we do not select any gene g that are 
not selected in the global set. 

Constraint (4) ensures all the sets corresponding to tumor cells of patient P(i) 
are hit. 


min 2 +g) (2) 
Je snot a Vi (3) 
x(g, P(i)) <x(g) ViVg € U (4) 

28, PO) ZL VNC € Spi) (5) 


In this ILP, there is a variable corresponding to each gene. In addition, there is a 
variable per each pair of gene and patient, therefore, the total number of variables is 
|U|(n + 1). The total number of constraints in the form of constraints (3), (4), and 


n 
(5) are n, n|U|, and > |Spiy|, respectively. 
i=1 


Parameterization of the fair cohort target set problem. In the Fair Cohort 
Target Set ILP shown above, we give more preference towards minimizing number 
of genes needed in the CTS. However, we do not take into account the number of 
non-tumor cells killed. Killing (covering) too many non-tumor cells potentially 
hurts patients. To avoid that, we add an additional constraint to both the ILP for 
the local instances and the global instance. Intuitively, for patient P(i), given an 
upper bound of the portion of the non-tumor cell killed ub, we want to find the 
smallest cohort target set H(i) with the following properties: 


1. H(i) covers all the tumor cells of patient P(i). 

2. H(i) covers at most ub « |NTp,)| where NTp,, is the set of non-tumor cells 
known for patient P(i); the binary variable y(C) represents whether the cell 
C is covered. 
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The ILP can be formulated as follows: 


min 5 x(g) (6) 

= x(g)21 VC € Spi (7) 
WC)2 ingentg) YC S Nag (8) 
cot, WC) Sub * |NT p@| (9) 


In the abovementioned ILP, the total number of variables is 
|U| + |Sua| + [NT 49 
variable corresponding to each cell of patient P(i). The total numbers of constraints 
in the form of constraints (7), (8), and (9) are |Suo|; ae [.1u. and 1, 
respectively. In the actual ILP implementation, constraint (8) is in fact |U| different 
constraints for each C € NT yj). 

With this formulation, the existence of a feasible solution is not guaranteed. 
However, covering all tumor cells might not always be necessary either as we 
discussed above in the context of RECIST criteria for tumor response to treatment. 
Hence, we add another parameter /b to let us model this scenario. In the high-level, 


this is the ratio of the tumor cells we want to cover. The ILP can be formulated as 
follows: 


, where there is a variable corresponding to each gene and a 


min & x(g) (10) 

ce, y(C) 2 Ib * |Spo| (11) 

(C) = max x(g) YC € Spi) UNT py) (12) 
codes WC) sub * |NT p@| (13) 


Notice that constraint (11) here is different from the one above as we only care 
about the total number of tumor cells covered. In the above ILP, the total number 


of variables is |U| + [Saal + [NT 49 
each gene and a variable corresponding to each cell of patient P(i). The total 
numbers of constraints in the form of constraints (11), (12), and (13) are 

iE (Se + [NT piu, and 1, respectively. 

Even with both ub and 1b, the feasibility of the ILP is still not guaranteed. 
However, modeling the ILP in this way allows us to parameterize the ILP for 
various other scenarios of interest. While the two ILPs above are designed for one 
patient, one can extend these ILPs for multi-patient cohort. In a way that is similar 


to how we define x(g, P(i)), the binary variable y(C, P(i)) denotes whether the cell C 
is covered for the ith patient. 


, where there is a variable corresponding to 


min 5 x(g) (14) 

Pa P(i)) < H(i) + a Vi (15) 

x(g, Pli)) < x(g) Vi,g € U (16) 

AC, PO)> max x(g, PD) VVC € Sop (17) 
XO P(i)) = Ib * |Spyy Vi (18) 
Sh, WC, P(i)) < ub * [NT py|Vi (19) 


In the abovementioned ILP, the total number of variables is |U| + n|U| + 


n 
2(|Sm| + \n Tx0) where there is a variable corresponding to each gene, a 
i=l 


variable per each pair of gene and patient, and a variable corresponding to each cell 
of each patient P(i). The total numbers of constraints in the form of constraints 


(15), (16), (17), (18), and (19) are n, n|UI, ¥([Spo|-1UD; n, and n, respectively. 
i=l 


Implementation note, accounting for multiple optima and software and data 
availability. We implemented in Python 3 the above fair cohort target set for- 
mulations, with the expression ratio r as an option when non-tumor cells are 
available. The parameters a, /b, ub can be set by the user in the command line. To 
solve the ILPs to optimality we usually used the SCIP library and its Python 
interface*°, To obtain multiple optimal solutions of equal size we used the Gurobi 
library (https://www.gurobi.com) and its python interface. When evaluating 


multiple optima, for all feasible instances, we sampled 50 optimal solutions that 
may or may not be distinct, using the Gurobi function select_solution(). To 
determine how often each gene or pair of genes occur in optimal solutions, we 
computed the arithmetic mean of gene frequencies and gene pair frequencies over 
all sampled optimal solutions. 

The software package is called MadHitter. The main program is called 
hitting_set.py. We include in MadHitter a separate program to sample cells and 
generate replicates, called sample_columns.py. So long as one seeks only single 
optimal solutions for each instance, exactly one of SCIP and Gurobi is sufficient to 
use MadHitter. We verified that SCIP and Gurobi give optimal solutions of the 
same size. If one wants to sample multiple optima, this can be done only with the 
Gurobi library. The choice between SCIP and Gurobi and the number of optima to 
sample are controlled by command-line parameters use_gurobi and num_sol, 
respectively. 


Reporting summary. Further information on research design is available in the Nature 
Research Reporting Summary linked to this article. 


Data availability 

Input files for all nine data sets used in most of our analyses are available without 
restrictions and can be found at https://ftp.ncbi.nlm.nih.gov/pub/catSMA/ 
MadHitter_data. The input data processed for this study have also been deposited in 
Zenodo [https://zenodo.org/record/6038684#.YhTZIDXLc2x] and can be located via the 
https://doi.org/10.5281/zenodo.6038684. The data consist of lightly processed files 
starting from the publicly available sources in Table 1 augmented for some data sets by 
cell type classifications or other information provided by the authors. Two important 
aspects of the processing are that we homogenized the data format and separated the 
cancer cells and non-cancer cells into two files for each patient. The remaining data are 
available within the Article, Supplementary Information, or Source data file. Source data 
are provided with this paper. 


Code availability 

The MadHitter software is available on GitHub at https://github.com/ruppinlab/ 
madhitter and equivalently via the https://doi.org/10.5281/zenodo.6038389 [https:// 
zenodo.org/record/6038389]. The GitHub site includes enough data to test the software. 
More input data can be obtained as explained in Data availability. The code is written in 
python 3. The ILPs generated in MadHitter can be solved using either SCIP®? or Gurobi 
(www.gurobi.com). More details about the ILP packages can be found in the last 
subsection of “Methods” and in the file README.md on the GitHub site. 
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